This notebook purpose is clearly understand to whetherAUS(predicition to whether rainy for AUS ) dataset. In here mostly used visualation for this. For prediction used Machine learning. In conclusion our ACC: %84 found. Description About Columns: Date: The date of observation Location: The common name of the location of the weather station MinTemp: The minimum temperature in degrees celsius MaxTemp: The maximum temperature in degrees celsius Rainfall: The amount of rainfall recorded for the day in mm Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am Sunshine: The number of hours of bright sunshine in the day WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight WindDir9am: Direction of the wind at 9am WindDir3pm: Direction of the wind at 3pm WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm Humadity9am: Humidity (percent) at 9am Humadity3pm: Humidity (percent) at 3pm Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast Cloud3pm: Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm Temp9am: Temperature (degrees C) at 9am Temp3pm: Temperature (degrees C) at 3pm RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0 RainTomorrow: The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk".
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from scipy import stats
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.express as px
%matplotlib inline
# Calling the dataset
df = pd.read_csv('weatherAUS.csv')
df.set_index('Date', inplace=True)
# Checking the first 5 rows to find out the missing data. there is a bunch of missing data in the dataset
df.head(5)
# Checking columns
df.columns
df.values
# checking shape
df.shape
df.info()
# showing the values of RainToday column
df['RainToday'].value_counts()
Exploratory Data Analisys
# correlation between values
corrmat = df.corr()
plt.subplots(figsize=(16,16))
sns.heatmap(corrmat,annot=True,square=True)
As it is obvious in the plot there is high correlation betwen some components. For example, Raonfall has hgh correlation with Min Temp, Cloud, and humidity. the lower the tempreture and the higher the humidity, the higher is the chance of raining
# RainToday values
sns.countplot(data=df, x='RainTomorrow', palette='Pastel1')
# checking the cities with the highest number of rainfalls
df[['Location','Rainfall']].groupby('Location').mean().sort_values(by='Rainfall', ascending=False).iloc[:20]
# scatter plot of humidity in 9 am and Rainfall
fig = px.scatter(df, x="Humidity9am", y="Rainfall",
marginal_x="box", trendline="ols", template="simple_white", marginal_y="violin")
fig.show()